DOP: Deep Optimistic Planning with Approximate Value Function Evaluation
نویسندگان
چکیده
Research on reinforcement learning has demonstrated promising results in manifold applications and domains. Still, efficiently learning effective robot behaviors is very difficult, due to unstructured scenarios, high uncertainties, and large state dimensionality (e.g. multi-agent systems or hyper-redundant robots). To alleviate this problem, we present DOP, a deep model-based reinforcement learning algorithm, which exploits action values to both (1) guide the exploration of the state space and (2) plan effective policies. Specifically, we exploit deep neural networks to learnQ-functions that are used to attack the curse of dimensionality during a Monte-Carlo tree search. Our algorithm, in fact, constructs upper confidence bounds on the learned value function to select actions optimistically. We implement and evaluate DOP on different scenarios: (1) a cooperative navigation problem, (2) a fetching task for a 7-DOF KUKA robot, and (3) a human-robot handover with a humanoid robot (both in simulation and real). The obtained results show the effectiveness of DOP in the chosen applications, where action values drive the exploration and reduce the computational demand of the planning process while achieving good performance.
منابع مشابه
Map-Based Strategies for Robot Navigation in Unknown Environments
Robot path planning algorithms for finding a goal in a unknown environment focus on completeness rather than optimality. In this paper, we investigate several strategies for using map information, however incomplete or approximate, to reduce the cost of the robot’s traverse. The strategies are based on optimistic, pessimistic, and average value assumptions about the unknown portions of the robo...
متن کاملOptimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not ...
متن کاملLambda-Policy Iteration: A Review and a New Implementation
In this paper we discuss λ-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associat...
متن کاملBounded Approximations for Linear Multi-Objective Planning Under Uncertainty
Planning under uncertainty poses a complex problem in which multiple objectives often need to be balanced. When dealing with multiple objectives, it is often assumed that the relative importance of the objectives is known a priori. However, in practice human decision makers often find it hard to specify such preferences, and would prefer a decision support system that presents a range of possib...
متن کاملApproximate Solution of the Second Order Initial Value Problem by Using Epsilon Modified Block-Pulse Function
The present work approaches the problem of achieving the approximate solution of the second order initial value problems (IVPs) via its conversion into a Volterra integral equation of the second kind (VIE2). Therefore, we initially solve the IVPs using Runge–Kutta of the forth–order method (RK), and then convert it into VIE2, and apply the εmodified block–pulse functions (εMBPFs) and their oper...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018